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Abstract. Regularization methods allow one to handle a variety of in- 
ferential problems where there are more covariates than cases. This 
allows one to consider a potentially enormous number of covariates 
for a problem. We exploit the power of these techniques, supersatu- 
rating models by augmenting the "natural" covariates in the problem 
with an additional indicator for each case in the data set. We attach 
a penalty term for these case-specific indicators which is designed to 
produce a desired effect. For regression methods with squared error loss, 
an l\ penalty produces a regression which is robust to outliers and high 
leverage cases; for quantile regression methods, an £2 penalty decreases 
the variance of the fit enough to overcome an increase in bias. The 
paradigm thus allows us to robustify procedures which lack robustness 
and to increase the efficiency of procedures which are robust. 

We provide a general framework for the inclusion of case-specific 
parameters in regularization problems, describing the impact on the 
effective loss for a variety of regression and classification problems. 
We outline a computational strategy by which existing software can 
be modified to solve the augmented regularization problem, providing 
conditions under which such modification will converge to the optimum 
solution. We illustrate the benefits of including case-specific parame- 
ters in the context of mean regression and quantile regression through 
analysis of NHANES and linguistic data sets. 
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leverage point, outlier, penalized method, quantile regression. 
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1. INTRODUCTION 

A core part of regression analysis involves the ex- 
amination and handling of individual cases (Weis- 
berg, 2005). Traditionally, cases have been removed 
or downweighted as outliers or because they exert 
an overly large influence on the fitted regression 
surface. The mechanism by which they are down- 
weighted or removed is through inclusion of case- 
specific indicator variables. For a least-squares fit, 
inclusion of a case-specific indicator in the model 
is equivalent to removing the case from the data 
set; for a normal-theory, Bayesian regression analy- 
sis, inclusion of a case-specific indicator with an ap- 
propriate prior distribution is equivalent to inflating 
the variance of the case and hence downweighting 
it. The tradition in robust regression is to handle 
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the case-specific decisions automatically, most often 
by downweighting outliers according to an iterative 
procedure (Huber, 1981). 

This idea of introducing case-specific indicators 
also applies naturally to criterion based regression 
procedures. Model selection criteria such as AIC or 
BIC take aim at choosing a model by attaching 
a penalty for each additional parameter in the model. 
These criteria can be applied directly to a larger 
space of models — namely those in which the covari- 
ates are augmented by a set of case indicators, one 
for each case in the data set. When considering in- 
clusion of a case indicator for a large outlier, the 
criterion will judge the trade-off between the empir- 
ical risk (here, negative log-likelihood) and model 
complexity (here, number of parameters) as favor- 
ing the more complex model. It will include the case 
indicator in the model, and, with a least-squares 
fit, effectively remove the case from the data set. 
A more considered approach would allow differen- 
tial penalties for case-specific indicators and "real" 
covariates. With adjustment, one can essentially re- 
cover the familiar i-tests for outliers (e.g., Weisberg, 
2005), either controlling the error rate at the level 
of the individual test or controlling the Bonferroni 
bound on the family wise error rate. 

Case-specific indicators can also be used in con- 
junction with regularization methods such as the 
LASSO (Tibshirani, 1996). Again, care must be taken 
with details of their inclusion. If these new covari- 
ates are treated in the same fashion as the other 
covariates in the problem, one is making an implicit 
judgment that they should be penalized in the same 
fashion. Alternatively, one can allow a second pa- 
rameter that governs the severity of the penalty for 
the indicators. This penalty can be set with a view 
of achieving robustness in the analysis, and it allows 
one to tap into a large, extant body of knowledge 
about robustness (Huber, 1981). 

With regression often serving as a motivating 
theme, a host of regularization methods for model 
selection and estimation problems have been devel- 
oped. These methods range broadly across the field 
of statistics. In addition to traditional normal-theory 
linear regression, we find many methods motivated 
by a loss which is composed of a negative log-likeli- 
hood and a penalty for model complexity. Among 
these regularization methods are penalized linear re- 
gression methods [e.g., ridge regression (Hoerl and 
Kennard, 1970) and the LASSO], regression with 
a nonparametric mean function, [e.g., smoothing 
splines (Wahba, 1990) and generalized additive mod- 



els (Hastie and Tibshirani, 1990)], and extension to 
regression with nonnormal error distributions, name- 
ly, generalized linear models (McCullagh and Nelder, 
1989). In all of these cases, one can add case-specific 
indicators along with an appropriate penalty in or- 
der to yield an automated, robust analysis. It should 
be noted that, in addition to a different severity for 
the penalty term, the case-specific indicators some- 
times require a different form for their penalty term. 

A second class of procedures open to modification 
with case-specific indicators are those motivated by 
minimization of an empirical risk function. The risk 
function may not be a negative log-likelihood. Quan- 
tile regression (whether linear or nonlinear) falls into 
this category, as do modern classification techniques 
such as the support vector machine (Vapnik, 1998) 
and the ^-learner (Shen et al., 2003). Many of these 
procedures are designed with the robustness of the 
analysis in mind, often operating on an estimand 
defined to be the population-level minimizer of the 
risk. The procedures are consistent across a wide 
variety of data-generating mechanisms and hence 
are asymptotically robust. They have little need of 
further robustification. Instead, scope for bettering 
these procedures lies in improving their finite sample 
properties. The finite sample performance of many 
procedures in this class can be improved by includ- 
ing case-specific indicators in the problem, along 
with an appropriate penalty term for them. 

This paper investigates the use of case-specific 
indicators for improving modeling and prediction 
procedures in a regularization framework. Section 2 
provides a formal description of the optimization 
problem which arises with the introduction of case- 
specific indicators. It also describes a computational 
algorithm and conditions that ensure the algorithm 
will obtain the global solution to the regularized 
problem. Section 3 explains the methodology for 
a selection of regression methods, motivating partic- 
ular forms for the penalty terms. Section 4 describes 
how the methodology applies to several classifica- 
tion schemes. Sections 5 and 6 contain simulation 
studies and worked examples. We discuss implica- 
tions of the work and potential extensions in Sec- 
tion 7. 

2. ROBUST AND EFFICIENT MODELING 
PROCEDURES 

Suppose that we have n pairs of observations de- 
noted by (xi,i/i), i = 1, . . . , n, for statistical model- 
ing and prediction. Here Xi = (xn, . . . ,Xi p ) T with p 
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covariates and the y^s are responses. As in the stan- 
dard setting of regression and classification, the yi 's 
are assumed to be conditionally independent given 
the XiS. In this paper, we take modeling of the data 
procedure of finding a functional relationship 
between Xi and yi, f(x;/3) with unknown parame- 
ters (3 £ MP that is consistent with the data. The 
discrepancy or lack of fit of / is measured by a loss 
function C(y, f(x; (3)). Consider a modeling proce- 
dure, say, Ai of finding / which minimizes (n times) 
the empirical risk 

n 
i=l 

or its penalized version, R n (f) + A J(f) = Yl?=i ^{Vh 
f(xi][3)) + AJ(/), where A is a positive penalty pa- 
rameter for balancing the data fit and the model 
complexity of / measured by J(f). A variety of 
common modeling procedures are subsumed under 
this formulation, including ordinary linear regres- 
sion, generalized linear models, nonparametric re- 
gression, and supervised learning techniques. For 
brevity of exposition, we identify / with (3 through 
a parametric form and view </(/) as a functional 
depending on (3. Extension of the formulation pre- 
sented in this paper to a nonparametric function / 
is straightforward via a basis expansion. 

2.1 Modification of Modeling Procedures 

First, we introduce case-specific parameters, 7 = 
(71, . . . , 7n) T , for the n observations by augmenting 
the covariates with n case-specific indicators. For 
convenience, we use 7 to refer to a generic element 
of 7, dropping the subscript. Motivated by the ben- 
eficial effects of regularization, we propose a general 
scheme to modify the modeling procedure Ai using 
the case-specific parameters 7, to enhance Ai for 
robustness or efficiency. Define modification of Ai 
to be the procedure of finding the original model 
parameters, j3, together with the case-specific pa- 
rameters, 7, that minimize 

n 

(1) 

+ A /3 J(/) + A 7 J 2 (7). 

If A^ is zero, Ai involves empirical risk minimiza- 
tion, otherwise penalized risk minimization. The ad- 
justment that the added case-specific parameters 
bring to the loss function C(y , f (x; /?)) is the same 
regardless of whether A^ is zero or not. 



In general, ^(7) measures the size of 7. When 
concerned with robustness, we often take ^(7) = 
H7II1 = Y^=i |Ti I - A rationale for this choice is that 
with added flexibility, the case-specific parameters 
can curb the undesirable influence of individual cases 
on the fitted model. To see this effect, consider mini- 
mizing L(/3, 7) for fixed /3, which decouples to a min- 
imization of C{yi, f{xi\ p) + 7i) + A 7 |7i| for each 7$. 
In most cases, an explicit form of the minimizer 7 

of L(/3,7) can be obtained. Generally 7j's are large 
for observations with large "residuals" from the cur- 
rent fit, and the influence of those observations can 
be reduced in the next round of fitting [3 with the 7- 
adjusted data. Such a case-specific adjustment would 
be necessary only for a small number of potential 
outliers, and the l\ norm which yields sparsity works 
to that effect. The adjustment in the process of se- 
quential updating of (3 is equivalent to changing the 
loss from C(y, f(x; /?)) to £(y, f(x; f3) + 7), which we 
call the ^-adjusted loss of C. The 7-adjusted loss is 
a re-expression of C in terms of the adjusted resid- 
ual, used clS £1 conceptual aid to illustrate the effect 
of adjustment through the case-specific parameter 7 
on C. Concrete examples of the adjustments will be 
given in the following sections. Alternatively, one 
may view C Xj (y, f(x; f3)) := min 7GR {£(y, f(x; (3) + 
7) + A 7 |7|} = C(y,f(x;(3) +7) + A 7 |7| as a whole to 
be the "effective loss" in terms of f3 after profiling 
out j. The effective loss replaces C(y, f(x; f3)) for the 
modified Ai procedure. When concerned with effi- 
ciency, we often take ^(7) = IMIi = S?=i7i 2 - This 
choice has the effect of increasing the impact of se- 
lected, nonoutlying cases on the analysis. 

In subsequent sections, we will take a few standard 
statistical methods for regression and classification 
and illustrate how this general scheme applies. This 
framework allows us to see established procedures 
in a new light and also generates new procedures. 
For each method, particular attention will be paid 
to the form of adjustment to the loss function by 
the penalized case-specific parameters. 

2.2 General Algorithm for Finding Solutions 

Although the computational details for obtaining 
the solution to (1) are specific to each modeling 
procedure Ai, it is feasible to describe a common 
computational strategy which is effective for a wide 
range of procedures that optimize a convex func- 
tion. For fixed and A 7 , the solution pair of (3 
and 7 to the modified Ai can be found with little 
extra computational cost. A generic algorithm be- 
low alternates estimation of f3 and 7. Given 7, mini- 
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mization of L(J3, 7) is done via the original modeling 
procedure Ai. In most cases we consider, minimiza- 
tion of L(/3,7) given f3 entails simple adjustment of 
"residuals." These considerations lead to the follow- 
ing iterative algorithm for finding (3 and 7: 

1. Initialize 7 (0) = and (o) = argmin^ L{/3, 0) (the 
ordinary Ai solution). 

2. Iteratively alternate the following two steps, m = 
0,1,...: ' 

• 7 ( - m+1 ^ = argmin 7 gKn L(/3( m ),7) modifies "re- 
siduals." 

• /3( m+1 ) = argmin^KpL^,^" 1 -^). This step 
amounts to reapplying the Ai procedure to 
7'( rn ' +1 ) -adjusted data although the nature of 
the data adjustment would largely depend on C 

3. Terminate the iteration when ||/3( m+1 ) -/?( m ) || 2 < 
e, where e is a prespecified convergence tolerance. 

In a nutshell, the algorithm attempts to find the 
joint minimizer (/3, 7) by combining the minimiz- 
ers /3 and 7 resulting from the projected subspaces. 
Convergence of the iterative updates can be estab- 
lished under appropriate conditions. Before we state 
the conditions and results for convergence, we briefly 
describe implicit assumptions on the loss function 
and the complexity or penalty terms, «/(/) and ^2(7)- 
£(y,f(x;/3)) is assumed to be nonnegative. For sim- 
plicity, we assume that J(f) of f(x; (3) depends on j3 
only, and that it is of the form </(/) = and 
J 2 (j) = for k > 1. The LASSO penalty has k = 
1 while a ridge regression type penalty sets k = 2. 
Many other penalties of this form for J(f) can be 
adopted as well to achieve better model selection 
properties or certain desirable performance of Ai. 
Examples include those for the elastic net (Zou and 
Hastie. 2005), the grouped LASSO (Yuan and Lin, 

2006) and the hierarchical LASSO (Zhou and Zhu, 

2007) . 

For certain combinations of the loss C and the 
penalty functionals, J(f) and ^(7), more efficient 
computational algorithms can be devised, as in Has- 
tie et al. (2004), Efron et al. (2004a) and Rosset 
and Zhu (2007). However, in an attempt to pro- 
vide a general computational recipe applicable to 
a variety of modeling procedures which can be im- 
plemented with simple modification of existing rou- 
tines, we do not pursue the optimal implementation 
tailored to a specific procedure in this paper. 

Convexity of the loss and penalty terms plays a pri- 
mary role in characterizing the solutions of the it- 



erative algorithm. For a general reference to prop- 
erties of convex functions and convex optimization, 
see Rockafellar (1997). Nonconvex problems require 
different optimization strategies. 

If L(/3,7) in (1) is continuous and strictly con- 
vex in P and 7 for fixed Aa and A 7 , the minimizer 
pair (/3,7) in each step is properly defined. That 
is, given 7, there exists a unique minimizer ,$(7) := 
argmin^ L(j3, 7), and vice versa. The assumption that 
L((3, 7) is strictly convex holds if the loss C(y, f(x; (3)) 
itself is strictly convex. Also, it is satisfied when 
a convex £(y, f(x; (3)) is combined with J(f) and 
^2(7) strictly convex in ft and 7, respectively. 

Suppose that L(f3, 7) is strictly convex in (3 and 7 
with a unique minimizer (/?*, 7*) for fixed \p and A 7 . 
Then, the iterative algorithm gives a sequence of 
(YjM^M) W ith strictly decreasing L(/3( m ),7 (m) ). 

Moreover, (j3^ m \^ m ^) converges to (/3*,7*). This re- 
sult of convergence of the iterative algorithm is well 
known in convex optimization, and it is stated here 
without proof. Interested readers can find a formal 
proof in Lee, MacEachern and Jung (2007). 

3. REGRESSION 

Consider a linear model of the form yi = xj (3 + £{. 
Without loss of generality, we assume that each co- 
variate is standardized. Let X be an n xp design ma- 
trix with xj in the ith row and let Y = (y± , . . . , y n ) T . 

3.1 Least Squares Method 

Taking the least squares method as a baseline 
modeling procedure Ai , we make a link between its 
modification via case-specific parameters and a clas- 
sical robust regression procedure. 

The least squares estimator of (3 = (f3\, . . . , (3 P ) T is 
the minimizer $ G W of L{(3) = \{Y - Xf3) T (Y - 
X(3). To reduce the sensitivity of the estimator to 
influential observations, the p covariates are aug- 
mented by n case indicators. Let Zi be the indica- 
tor variable taking 1 for the ith observation and 
otherwise, and let 7 = (7i,...,7 n ) T be the coeffi- 
cients of the case indicators. The additional design 
matrix Z for Zi is the identity matrix, and Z7 be- 
comes 7 itself. The proposed modification of the 
least squares method with 72(7) = H7H1 = Y27=i |7«I 
leads to a well-known robust regression procedure. 
For the robust modification, we find f3 G R p and 
7 E W 1 that minimize 

L(P, 7) = W - (X/3 + 7 )} T {^ - (X/3 + 7)} 

(2) 

+ A 7ll7l|l, 
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where A 7 is a fixed regularization parameter con- 
straining 7. Just as the ordinary LASSO with the i\ 
norm penalty stabilizes regression coefficients by 
shrinkage and selection, the additional penalty in (2) 
has the same effect on 7, whose components gauge 
the extent of case influences. 

The minimizer 7 of L(f3,j) for a fixed /3 can be 
found by soft-thresholding the residual vector r = 
Y — Xj3. That is, % = sgn(rj)(|rj| — A 7 ) + . For obser- 
vations with small residuals, \ri\ < A 7 , % is set equal 
to zero with no effect on the current fit, and for those 
with large residuals, > A 7 , ji is set equal to the 
residual ri = yi — xj /3 offset by A 7 toward zero. Com- 
bining 7 with /3, we define the adjusted residuals to 

be r* = yi — xj (3 — jf, that is, r* = n if |rj| < A 7 , and 
r* = sgn(rj)A 7 , otherwise. Thus, introduction of the 
case-specific parameters along with the i\ penalty 
on 7 amounts to winsorizing the ordinary residu- 
als. The 7-adjusted loss is equivalent to truncated 
squared error loss which is (y — x T (3) 2 if \y — x T /3| < 
A 7 , and is A 2 otherwise. Figure 1 shows (a) the re- 
lationship between the ordinary residual r and the 



corresponding 7, (b) the residual and the adjusted 
residual r*, (c) the 7-adjusted loss as a function of r, 
and (d) the effective loss. 

The effective loss is C\ i {y,x Y f3) = (y — x T (3) 2 /2 
if \y - x T /3| < A 7 , and A 7 /2 + A 7 (|y - x T /3| - A 7 ) 
otherwise. This effective loss matches Huber's loss 
function for robust regression (Huber, 1981). As in 
robust regression, we choose a sufficiently large A 7 
so that only a modest fraction of the residuals are 
adjusted. Similarly, modification of the LASSO as 
a penalized regression procedure yields the Huber- 
ized LASSO described by Rosset and Zhu (2004). 

3.2 Location Families 

More generally, a wide class of problems can be 
cast in the form of a minimization of L(f5) = 
Y^i=i9{Vi ~ x l P) where g(-) is the negative log- 
likelihood derived from a location family. The as- 
sumption that we have a location family implies that 
the negative log-likelihood is a function only of = 
yi — xj f3. Dropping the subscript, common choices 
for the negative log-likelihood, g(r), include r 2 (least 



6 



Y. LEE, S. N. MACEACHERN AND Y. JUNG 



squares, normal distributions) and \r\ (least absolute 
deviations, Laplace distributions). 

Introducing the case-specific parameters ji, we 
wish to minimize 

n 

L(P,j) = ^2a(yi - xJ/3 - 7i) + A 7 ||tJ|i. 

i=l 

For minimization with a fixed /3, the next result ap- 
plies to a broad class of g(-) (but not to g(r) = \r\). 

Proposition 1. Suppose that g is strictly con- 
vex with the minimum at 0, and lim^-too g'(r) = 
±oo, respectively. Then, 

7 = arg min g(r — 7) + AJ7I 

'r-gf-^iXy), 

forr > 5 '- 1 (A 7 ), 
0, forg'^i-X^KrKg'- 1 ^), 

forr < 5 '~ 1 (-A 7 ). 

The proposition follows from straightforward al- 
gebra. Set the first derivative of the decoupled mini- 
mization equation equal to and solve for 7. Insert- 
ing these values for 73 into the equation for L(/3,j) 
yields 

n 

L0,Z) =^2a( r i ~7i) + A 7llllli- 
1=1 

The first term in the summation can be decom- 
posed into three parts. Large rj contribute g(ri — 
r i + 5 ,/_1 (\)) = 5'(5 ,/ ~ 1 (\))- Large, negative n con- 
tribute g{g'~ l {— A 7 )). Those 77 with intermediate 
values have % = and so contribute g{ri). Thus 
a graphical depiction of the 7-adjusted loss is much 
like that in Figure 1, panel (c), where the loss is 
truncated above. For asymmetric distributions (and 
hence asymmetric log-likelihoods), the truncation 
point may differ for positive and negative residuals. 
It should be remembered that when \ri\ is large, the 
corresponding 7, is large, implying a large contri- 
bution of ||7||i to the overall minimization problem. 
The residuals will tend to be large for vectors (3 that 
are at odds with the data. Thus, in a sense, some of 
the loss which seems to disappear due to the effec- 
tive truncation of g is shifted into the penalty term 
for 7. Hence the effective loss C\^{y, f(x;(3)) = g(y — 
f(x; 0) — 7) + A 7 |7| is the same as the original loss, 
g(y — f(x;/3)) when the residual is in [<? (— A 7 ), 



</ _1 (A 7 )] and is linear beyond the interval. The lin- 
earized part of g is joined with g such that £\ 1 is 
differentiable. 

Computationally, the minimization of L(/3,7) giv- 
en 7 entails application of the same modeling proce- 
dure Ai with g to winsorized pseudo responses y* = 
yi - % where y* = yi for c/ /_1 (-A 7 ) <n< 5 /_1 (A 7 ), 
y*=g'- 1 {X f ) for r > 5 '- 1 (A 7 ), and y* = ^(-X^) 
for r < g' (— A 7 ). So, the 7-adjusted data in Step 2 
of the main algorithm consist of (xi,y*) pairs in 
each iteration. A related idea of subsetting data and 
model-fitting to the subset iteratively for robustness 
can be found in the computer vision literature, the 
random sample consensus algorithm (Fischler and 
Bolles, 1981) for instance. 

3.3 Quantile Regression 

Consider median regression with absolute devia- 
tion loss C(y, x T /3) = \y — x T /3| , which is not covered 
in the foregoing discussion. It can be verified easily 
that the ^i-adjustment of £ is void due to the piece- 
wise linearity of the loss, reaffirming the robustness 
of median regression. For an effectual adjustment, 
the £2 norm regularization of the case-specific pa- 
rameters is considered. With the case-specific pa- 
rameters 7i, we have the following objective function 
for modified median regression: 



(3) t(/3,7) = ^l»-xT/J- 7j 



For a fixed /3 and residual r = y — x T /3, the 7 mini- 
mizing \r — 7I + 4r7 2 is given by 



1 



sgn(r)— I 



> — ) 



1 

< — 

A 7 



The 7-adjusted loss for median regression is 

1 



C(y,x T /3 + j) 



y-x T (3 



y-x 



/3|> 



A. 



as shown in Figure 3(a) below. Interestingly, this 
^2-adjusted absolute deviation loss is the same as 
the so-called "e-insensitive linear loss" for support 
vector regression (Vapnik, 1998) with e = 1/A 7 . 

With this adjustment, the effective loss is Huber- 
ized squared error loss. The £2 adjustment makes 
median regression more efficient by rounding the 
sharp corner of the loss, and leads to a hybrid pro- 
cedure which lies between mean and median regres- 
sion. Note that, to achieve the desired effect for me- 
dian regression, one chooses quite a different value 
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of A 7 than one would when adjusting squared error 
loss for a robust mean regression. 

The modified median regression procedure can be 
also combined with a penalty on ft for shrinkage 
and/or selection. Bi et al. (2003) considered support 
vector regression with the £\ norm penalty ||/3||i for 
simultaneous robust regression and variable selec- 
tion. These authors relied on the e-insensitive linear 
loss which comes out as the 7-adjusted loss of the 
absolute deviation. In contrast, we rely on the effec- 
tive loss which produces a different solution. 

In general, quantile regression (Koenker and Bas- 
sett, 1978; Koenker and Hallock, 2001) can be used 
to estimate conditional quantiles of y given x. It is 
a useful regression technique when the assumption 
of normality on the distribution of the errors e is not 
appropriate, for instance, when the error distribu- 
tion is skewed or heavy-tailed. For the qth. quantile, 
the check function p q is employed: 

(4) p(r)-{ qr ' forr>0, 

W W-\_(l_ g ) r , forr<0. 

The standard procedure for the gth quantile regres- 
sion finds ft that minimizes the sum of asymmetri- 
cally weighted absolute errors with weight q on pos- 
itive errors and weight (1 — q) on negative errors: 



i=l 

Consider modification of p q with specific pa- 

rameter 7 and £2 norm regularization. Due to the 
asymmetry in the loss, except for q = 1/2, the size 
of reduction in the loss by the case-specific parame- 
ter 7 would depend on its sign. Given ft and resid- 
ual r = y — x T ft, if r > 0, then the positive 7 would 
lower p q by qj, while if r < 0, the negative 7 with the 
same absolute value would lower the loss by (q — 1)7. 
This asymmetric impact on the loss is undesirable. 
Instead, we create a penalty that leads to the same 
reduction in loss for positive and negative 7 of the 
same magnitude. In other words, the desired £2 norm 
penalty needs to put 57+ and (1 — q)j- on an equal 
footing. This leads to the following penalty propor- 
tional to q 2 "j\ and (1 — q) 2 ^ 2 _: 

J 2 ( 7 ) := {q/(l - q)h 2 + + {(1 - q)/qW- 

When q = 1/2, ^(7) becomes the symmetric £2 norm 
of 7. 

With this asymmetric penalty, given ft, 7 is now 
defined as 

(5) argmin£ A (/3, 7 ) := p q (r - 7) + ^ J 2 ('j), 



and is explicitly given by 



— -n 



A- 



-r- )+rl[ — 7— < r < 



1-q 



+ 



1 



q -l(r> 1 



The effective loss pq is then given by 

g(i-g) 



(6) 



Pa( r ) 



(q-l)r 



2 A. 



for r < , 



A 7 1 



q 2 

— r , 



for ~Y-<r<0, 

At 



vy 

,2 



2 1-q 



for < r < 



qr 



Ki-?) 



2A 
for r > 



and its derivative is 
( 



(7) W(r) 



1. 



for r < 



Q_ 

A-- 



1-q 



for — — < r < 0, 
for < r < — — -, 



for r > 



We note that, under the assumption that the density 
is locally constant in a neighborhood of the quan- 
tile, the quantile remains the of the effective ip q 
function. 

Figure 2 compares the derivative of the check loss 
with that of the effective loss in (6). Through penal- 
ization of a case-specific parameter, p q is modified to 
have a continuous derivative at the origin joined by 
two lines with a different slope that depends on q. 
The effective loss is reminiscent of the asymmetric 
squared error loss (q(r + ) 2 + (1 — g)(r_) 2 ) consid- 
ered by Newey and Powell (1987) and Efron (1991) 
for the so-called expectiles. The proposed modifi- 
cation of the check loss produces a hybrid of the 
check loss and asymmetric squared error loss, how- 
ever, with different weights than those for expectiles, 
to estimate quantiles. The effective loss is formally 
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r r 

Fig. 2. The derivative of the check loss in the left panel, ip qi and that of the modified check loss in the right panel, ip^, for 
g = 0.2, 0.5 and 0.7. 



similar to the rounded-corner check loss of Nychka 
et al. (1995) who used a vanishingly small adjust- 
ment to speed computation. Portnoy and Koenker 
(1997) thoroughly discussed efficient computation 
for quantile regression. 

Redefining 72(7) as the sum of the asymmetric pen- 
alty for the case-specific parameter 7,, i = 1, .. . ,n, 
modified quantile regression is formulated as a pro- 
cedure that finds fi and 7 by minimizing 

(8) m 7) = E - x ~iP - + y J 2(i). 

i=l 

In extensive simulation studies (Jung, MacEach- 
ern and Lee, 2010), such adjustment of the standard 
quantile regression procedure generally led to more 
accurate estimates. See Section 5.1.1 for a summary 
of the studies. This is confirmed in the 
NHANES data analysis in Section 6.1. 

For large enough samples, with a fixed A 7 , the bias 
of the enhanced estimator will typically outweigh 
its benefits. The natural approach is to adjust the 
penalty attached to the case-specific covariates as 
the sample size increases. This can be accomplished 
by increasing the parameter A 7 as the sample size n 
grows. 

Let A 7 := cn a for some constant c and a > 0. 
The following theorem shows that if a is sufficiently 
large, the modified quantile regression estimator fin, 
which minimizes J2i=iPq(yi ~ X J P) or equivalent- 
ly (8), is asymptotically equivalent to the standard 
estimator fi n . Knight (1998) proved the asymptotic 
normality of the regression quantile estimator fi n un- 
der some mild regularity conditions. Using the ar- 
guments in Koenker (2005), we show that fin has 
the same limiting distribution as fi n , and thus it is 
-y/n-consistent if a is sufficiently large. 



Allowing a potentially different error distribution 
for each observation, let Yi,Y2, ... be independent 
random variables with c.d.f.'s F\,F%, . . . and suppose 
that each Fi has continuous p.d.f. fi. Assume that 
the gth conditional quantile function of Y given x 
is linear in x and given by x T fi(q), and let £i(q) '■= 
xj fi(q). Now consider the following regularity con- 
ditions: 

(C-l) i = 1,2, . . . , are uniformly bounded away 

from and 00 at £j. 

(C-2) i = 1,2,..., admit a first-order Taylor 

expansion at and /,'(£) are uniformly bound- 
ed at £j. 

(C-3) There exists a positive definite matrix Dq such 

that lim n _^oo n -1 Yj x i x J = A)- 
(C-4) There exists a positive definite matrix D\ such 

that lim^oo n' 1 fi(^i)xixj = D 1 . 
(C-5) maxj = i r ,. )n ||xj||/y7T— > in probability. 

(C-l) and (C-3) through (C-5) are the conditions 
considered for the limiting distribution of the stan- 
dard regression quantile estimator fi n in Koenker 
(2005) while (C-2) is an additional assumption that 
we make. 

Theorem 2. Under the conditions (C-l)-(C-5), 
if a > 1/3, then 

MfiZ - <%)) ^ N(0, g (l - q)D?D D?). 
The proof of the theorem is in the Appendix. 

4. CLASSIFICATION 

Now suppose that y^s indicate binary outcomes. 
For modeling and prediction of the binary responses, 
we mainly consider margin-based procedures such as 
logistic regression, support vector machines (Vapnik, 
1998), and boosting (Freund and Schapire, 1997). 
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Fig. 3. Modification of (a) absolute deviation loss for median regression with I2 penalty, (b) negative log-likelihood for 
logistic regression with i\ penalty, and (c) hinge loss for the support vector machine with £2 penalty. The solid lines are for 
the effective loss, the dashed lines are for the ^/-adjusted loss, and the dotted lines are for the original loss in each panel. 



These procedures can be modified by the addition 
of case indicators. 

4.1 Logistic Regression 

Although it is customary to label a binary out- 
come as or 1 in logistic regression, we instead adopt 
the symmetric labels of {—1,1} for y^s. The sym- 
metry facilitates comparison of different classifica- 
tion procedures. Logistic regression takes the neg- 
ative log-likelihood loss for estimation of logit 
fix) = log[p(x)/(l -p(x))]. The loss, £{y,f{x)) = 
log[l + exp(— yfix))], can be viewed as a function of 
the so-called margin, yf(x). This functional margin 
of yf(x) is a pivotal quantity for defining a family of 
loss functions in classification similar to the residual 
in regression. 

As in regression with continuous responses, case 
indicators can be used to modify the logit function 
fix) in logistic regression to minimize 

7) 

n 

(9) = log(l + exp(- yi {/(x i; + 7l })) 

i=l 

+ A 7 ||tJ|i, 

where /(x;/3o,/3) = Po + x T (3. When it is clear in 
context, fix) will be used as abbreviated notation 
for fix; /3o, /3), a discriminant function, and the sub- 
script i will be dropped. For fixed /3o and /3, the min- 
imization decouples, and 7, is determined by mini- 
mizing 

log(l + exp(-y i {/(x i ; / So,/3) + 7i})) + A 7 |7i|- 



First note that the minimizer 7, must have the same 
sign as yi. Letting r = yfix) and assuming that < 
A 7 < 1, we have argmin 7 >o log(l + exp(— r — 7)) + 
A 7 | 7 | = log{(l - A 7 )/A 7 } -r if r < log{(l - A 7 )/A 7 }, 
and otherwise. This yields a truncated negative 
log-likelihood given by 



C(yJ(x)) = { 



( log(l + A 7 /(l-A 7 )), 

if y/(x)<log{(l-A 7 )/A 7 }, 
log(l + exp(-y/0))), 
otherwise, 



as the 7-adjusted loss. This adjustment is reminis- 
cent of Pregibon's (1982) proposal tapering the de- 
viance function so as to downweight extreme obser- 
vations, thereby producing a robust logistic regres- 
sion. See Figure 3(b) for the 7-adjusted loss (the 
dashed line), where r]\ := log{(l — A 7 )/A 7 } is a de- 
creasing function of A 7 . A 7 determines the level of 
truncation of the loss. As A 7 tends to 1, there is 
no truncation. Figure 3(b) also shows the effective 
loss (the solid line) for the l\ adjustment, which lin- 
earizes the negative log- likelihood below r\\. 

4.2 Large Margin Classifiers 

With the symmetric class labels, the foregoing 
characterization of the case-specific parameter 7 in 
logistic regression can be easily generalized to vari- 
ous margin-based classification procedures. In classi- 
fication, potential outliers are those cases with large 
negative margins. Let ^(r) be a loss function of the 
margin r = y/(x). The following proposition, anal- 
ogous to Proposition 1, holds for a general family of 
loss functions. 
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Proposition 3. Suppose that g is convex and 
monotonically decreasing in r , and g' is continuous. 
Then, for A 7 < — lim r _ > _ 00 g'(r), 

7 = argmin g(r + 7) + AJ7I 

= (g'- 1 (-^)-r, forr<g'-\-X y ), 
10, forT>g'~ l (-\ 1 ). 

The proof is straightforward. Examples of the mar- 
gin-based loss g satisfying the assumption include 
the exponential loss g{r) = exp(— r) in boosting, the 
squared hinge loss g(r) = {(1 — t) + } 2 in the support 
vector machine (Lee and Mangasarian, 2001), and 
the negative log-likelihood g(r) = log(l + exp(— r)) 
in logistic regression. Although their theoretical tar- 
gets are different, all of these loss functions are trun- 
cated above for large negative margins when ad- 
justed by 7. Thus, the effective loss C\ (y, f(x)) = 
9(yf( x ) +7) + -^7 1 7 1 is obtained by linearizing g for 
yf{x)<g'-\-\ : ). 

The effect of 7-adjustment depends on the form 
of g, and hence on the classification method. For 
boosting, 7 = — log A 7 — yf(x) if yf(x)< — logA 7 , and 
is otherwise. This gives L(/3q, P, A f) = z 7 27=i ex P( — Ui ' 
f(xi; /3 , p) -74) = Ya=i ex P(-7i) exp(-Vif(xi; Po,P)) 
So, finding Pq and P given 7 amounts to weighted 
boosting, where the positive case-specific parame- 
ters % downweight the corresponding cases by 
exp(— 7j). For the squared hinge loss in the sup- 
port vector machine, 7 = l — yf(x) — A 7 /2 if yf(x) < 
1 — A 7 /2, and is otherwise. A positive case-specific 
parameter ji has the effect of relaxing the margin re- 
quirement, that is, lowering the joint of the hinge for 
that specific case. This allows the associated slack 
variable to be smaller in the primal formulation. 
Accordingly, the adjustment affects the coefficient 
of the linear term in the dual formulation of the 
quadratic programming problem. 

As a related approach to robust classification, Wu 
and Liu (2007) proposed truncation of margin-based 
loss functions and studied theoretical properties that 
ensure classification consistency. Similarity exists be- 
tween our proposed adjustment of a loss function 
with 7 and truncation of the loss at some point. 
However, it is the linearization of a margin-based 
loss function on the negative side that produces its 
effective loss, and minimization of the effective loss 
is quite different from minimization of the truncated 
(i.e., adjusted) loss. Linearization is more conducive 
to computation than is truncation. Application of 
the result in Bartlett, Jordan and McAuliffe (2006) 



shows that the linearized loss functions satisfy suffi- 
cient conditions for classification consistency, namely 
Fisher consistency, which is the main property in- 
vestigated by Wu and Liu (2007) for truncated loss 
functions. 

Xu, Caramanis and Mannor (2009) showed that 
regularization in the standard support vector ma- 
chine is equivalent to a robust formulation under 
disturbances of x without penalty. In contrast, un- 
der our approach, robustness of classification meth- 
ods is considered through the margin, which is anal- 
ogous to the residual in regression. This formulation 
can cover outliers due to perturbation in x as well 
as mislabeling of y. 

4.3 Support Vector Machines 

As a special case of a large margin classifier, the 
linear support vector machine (SVM) looks for the 
optimal hyperplane f(x; Pq, ft) = Pq + x T P = min- 
imizing 

n A 
(10) L x (Po,P) = £[1 - yifixi-p^P)]^ + -II/3IH, 

i=i 

where [t] + = max(i,0) and A > is a regulariza- 
tion parameter. Since the hinge loss for the SVM, 
g(r) = (1 — t)+, is piecewise linear, its linearization 
with H7H1 is void, indicating that it has little need of 
further robustification. Instead, we consider modifi- 
cation of the hinge loss with H7H2. This modification 
is expected to improve efficiency, as in quantile re- 
gression. 

Using the case indicators Z{ and their coefficients 7, , 
we modify (10), arriving at the problem of minimiz- 
ing 

n 

L{p ,p,j) = £[1 - y i {f(x i ;P ,P) + 7i}] + 

i=l 

(11) 

+ ^« + ^ii7iii. 

For fixed Pq and P, the minimizer 7 of L(/?o,/3,7) 
is obtained by solving the decoupled optimization 
problem of 

mi 5-[l - Vifixf, Po,P)- ym]+ + ^7 4 2 for each 74. 

7iSM Z 

With an argument similar to that for logistic re- 
gression, the minimizer ji should have the same sign 
as yi. Let £=l — yf. A simple calculation shows that 

argmin[£-7] + + ^ 7 2 

7>0 Z 
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0, if£<0, 

£, ifO<£<l/A. 

1/A 7 , if£>l/A 7 . 



Hence, the increase in margin y^i due to inclusion 
of 7 is given by 

{l - y,f(x,)}i (o < l - nf(xi) < 
+ i. / ( 1 _ !/(/w >i_). 

The 7-adjusted hinge loss is C(y, f(x)) = [1 — 1/A 7 — 
yf{x)}+ with the hinge lowered by 1/A 7 as shown in 
Figure 3(c) (the dashed line). The effective loss (the 
solid line in the figure) is then given by a smooth 
function with the joint replaced with a quadratic 
piece between f — f /A 7 and 1 and linear beyond the 
interval. 

5. SIMULATION STUDIES 

We present results from various numerical exper- 
iments to illustrate the effect of the proposed modi- 
fication of modeling procedures by regularization of 
case-specific parameters. 

5.1 Regression 

5.1.1 £2- adjusted quantile regression The effective- 
ness of the ^-adjusted quantile regression depends 
on the penalty parameter A 7 in (6), which yields 
(— g/A 7 , (1 — <?)/A 7 ) as the interval of quadratic ad- 
justment. 

We undertook extensive simulation studies (avail- 
able in Jung, MacEachern and Lee, 2010) to estab- 
lish guidelines for selection of the penalty parame- 
ter A 7 in the linear regression model setting. The 
studies encompassed a range of sample sizes, from 
10 2 to 10 4 , a variety of quantiles, from 0.1 to 0.9, 
and distributions exhibiting symmetry, varying de- 
grees of asymmetry, and a variety of tail behav- 
iors. The modified quantile regression method was 
directly implemented by specifying the effective ip- 
function ip q , the derivative of the effective loss, in 
the rim function in the R package. 

An empirical rule was established via a (robust) 
regression analysis. The analysis considered A 7 of 
the form c q n a /a, where c q is a constant depending 
on q and a is a robust estimate of the scale of the 
error distribution. The goal of the analysis was to 
find A 7 which, across a broad range of conditions, 
resulted in an MSE near the condition-specific min- 
imum MSE. Here MSE is defined as mean squared 



error of estimated regression quantiles at a new X 
integrated over the distribution of the covariates. 

After initial examination of the MSE with a range 
of a values, we made a decision to set a to 0.3 for 
good finite sample performance across a wide range 
of conditions. With fixed a, we varied c q to obtain 
the smallest MSE by grid search for each condition 
under consideration. For a quick illustration, Fig- 
ure 4 shows the intervals of adjustment with such 
optimal c q for various error distributions, q values, 
and sample sizes. Wider optimal intervals indicate 
that more quadratic adjustment is preferred to the 
standard quantile regression for reduction of MSE. 
Clearly, Figure 4 demonstrates the benefit of the 
proposed quadratic adjustment of quantile regres- 
sion in terms of MSE across a broad range of situa- 
tions, especially when the sample size is small. 

In general, MSE values begin to decrease as the 
size of adjustment increases from zero and increase 
after hitting the minimum, due to an increase in 
bias. There is an exception of this typical pattern 
when estimating the median with normally distribut- 
ed errors. MSE monotonically decreases in this case 
as the interval of adjustment widens, confirming the 
optimality properties of least squares regression for 
normal theory regression. The comparisons between 
sample mean and sample median can be explicitly 
found under the t error distributions using different 
degrees of freedom. The benefit of the median rela- 
tive to the mean is greater for thicker tailed distri- 
butions. We observe that this qualitative behavior 
carries over to the optimal intervals. Thicker tails 
lead to shorter optimal intervals, as shown in Fig- 
ure 4. 

Modeling the optimal condition-specific c q func- 
tion of q through a robust regression analysis led 
to the rule, with a = 0.30, of c q « 0.5 exp(— 2.118 — 
1.097<?) for q < 0.5 and c q 0.5exp(-2.118- 1.097(1 - 
q)) for q > 0.5. The simulation studies show that this 
choice of penalty parameter results in an accurate 
estimator of the quantile surface. 

5.1.2 Robust LASSO We investigated the sensi- 
tivity of the LASSO (or LARS) and its robust ver- 
sion (obtained by the proposed t\ modification) to 
contamination of the data through simulation. 

For the robust LASSO, the iterative algorithm in 
Section 2 was implemented by using LARS (Efron 
et al., 2004a) as the baseline modeling procedure and 
winsorizing the residuals with A 7 bending con- 
stant. The bending constant was taken to be scale 
invariant, so that A 7 = k&, where k is a constant 




Gamma(shape=3,scale=sqrt(3)) Log-Normal Exp(3) 




"i 1 1 1 1 r n 1 1 1 1 — n 1 1 1 1 r 

2 4 6 8 10 0.0 0.5 1.0 1.5 2.0 1 2 3 4 5 

q=(0.1 ,0.2,0.3,0.4,0.5,0.6,0.7) q=(0.1 ,0.2,0.3,0.4,0.5,0.6,0.7) q=(0.1 ,0.2,0.3,0.4,0.5,0.6,0.7) 



Fig. 4. "Optimal" intervals of adjustment for different quantiles (q), sample sizes (n), and error distributions. The intervals 
range from the quantile minus q/X-y to the quantile plus (1 — q)/A 7 , with A 7 minimizing MSE. The vertical lines in each 
distribution indicate the true quantiles. The stacked horizontal lines for each quantile are corresponding optimal intervals. 
Five intervals at each quantile are for n = 10 2 , 10 2 ' 5 , 10 3 , 10 3 ' 5 and 10 4 , respectively, from the bottom. 



and a is a robust scale estimate. The standard ro- 
bust statistics literature (Huber, 1981) suggests that 
good choices of k lie in the range from 1 to 2. 

For brevity, we report only that portion of the re- 
sults pertaining to accuracy of the fitted regression 
surface and inclusion of variates in the model when 
k = 2. Similar results were obtained for k near 2. The 
results differ for extreme values of k. Throughout the 
simulation, the standard linear model y = x T /3 + e 
was assumed. Following the simulation setting in 
Tibshirani (1996), we generated x = (x\, . . . , xg) T 



from a multivariate normal distribution with mean 
zero and standard deviation 1. The correlation be- 
tween Xi and Xj was set to p^~^ with p = 0.5. Three 
scenarios were considered with a varying degree of 
sparsity in terms of the number of nonzero true co- 
efficients: (i) sparse: /3 = (5,0,0,0,0, 0,0,0), (ii) in- 
termediate: (3 = (3, 1.5, 0, 0, 2, 0, 0, 0) and (iii) dense: 
(3j = 0.85 for all j = 1, . . . ,8. In all cases, the sam- 
ple size was 100. For the base case, £j was assumed 
to follow N(0, a 2 ) with a = 3. For potential outliers 
in e, the first 5% of the e^'s were tripled, yielding 
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Fig. 5. Mean squared error (MSE) of P for LARS and its 
robust version under three different scenarios in the simula- 
tion study. In each scenario, o, e, and x indicate clean data, 
data with contaminated measurement errors, and data with 
mismeasured first covariate. The dotted lines are for LARS 
while the solid lines are for robust LARS. The points are the 
average MSE for 100 replicates. 

a data set with more outliers. We also investigated 
sensitivity to high leverage cases. For this setting, 
we tripled the first 5% of the values of x\. Thus 
the replicates were blocked across the three settings. 
The C p criterion was used to select the model. 

Figure 5 shows mean squared error (MSE) be- 
tween the fitted and true regression surfaces, omit- 
ting intercepts. MSE is integrated across the distri- 
bution of a future X , taken to be that for the base 
case of the simulation. Over the m = 100 replicates 
in the simulation, MSE = m" 1 E J =i(/? i ~ #) T S(^ - 
/3), where $ l is the estimate of j3 for the ith repli- 
cate, and £ is the covariance matrix of X. LARS 
and robust LARS perform comparably in the base 
case, with the MSE for robust LARS being greater 
by 1 to 6 percent. For both LARS and robust LARS, 
MSE in the base case increases as one moves from 
the sparse to the dense scenario. MSE increases no- 
ticeably when e is contaminated, by a factor of 1.31 



to 1.41 for LARS. For robust LARS, the factor for 
increase over the base case with LARS is 1.12 to 
1.22. For contamination in X , results under LARS 
and robust LARS are similar in the intermediate 
and dense cases, with increases in MSE over the base 
case. For the sparse case, the coefficient of the con- 
taminated covariate, x±, is large relative to the other 
covariates. Here, robust LARS performs noticeably 
better than LARS, with a smaller increase in MSE. 

Table 1 presents results on the difference in num- 
ber of selected variables for pairs of models. In each 
pair, a contaminated model is contrasted with the 
corresponding uncontaminated model. The top half 
of the table presents results for contamination of e. 
The distribution of the differences in the number 
of selected variables for the pairs of fitted models 
has a mode at in each scenario for both LARS 
and robust LARS. There is, however, substantial 
spread around 0. The fitted models for the data 
with contaminated errors tend to have fewer vari- 
ables than those for the original data, especially in 
the dense scenario. This may well be attributed to 
inflated estimates of a 2 used in C p for the contami- 
nated data, favoring relatively smaller models. The 
effect is stronger for LARS than for robust LARS, 
in keeping with the lessened impact of outliers on 
the robust estimate of a 2 . 

The bottom half of Table 1 presents results for 
contamination of X. Again, the distributions of dif- 
ferences in model size have modes at in all scenar- 
ios. The distributions have substantial spread around 
0. Under the sparse scenario in which the contam- 
ination has a substantial impact on MSE, the dis- 
tribution under robust LARS is more concentrated 
than under LARS. 

The simulation demonstrates that the proposed 
robustification is successful in dealing with both con- 
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Distribution of difference in the number of selected variables for the fitted model to contaminated data from that to clean data 
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taminated errors and contaminated covariates. As 
expected, in contrast to LARS, robust LARS is ef- 
fective in identifying observations with large mea- 
surement errors and lessening their influence. It is 
also effective at reducing the impact of high lever- 
age cases, especially when the high leverage arises 
from a covariate with a large regression coefficient. 
The combined benefits of robustness to outliers and 
high leverage cases render robust LARS effective at 
dealing with influential cases in an automated fash- 
ion. 

5.2 Classification 

A three-part simulation study was carried out to 
examine the effect of the proposed modification of 
loss functions for classification. The primary focus is 
on (i) the efficiency of the modified SVM relative to 
the SVM with hinge loss and its smoothed version 
with quadratically modified hinge loss, and (ii) the 
robustness of logistic regression relative to modified 
logistic regression (via the linearized deviance). The 
secondary focus is on ensuring that robustness does 
not significantly degrade as efficiency is improved, 
and that efficiency does not suffer too much as ro- 
bustness is improved. 

All three parts of the simulation begin with n = 
100 cases generated from a pair of five-dimensional 
multivariate normal distributions, with identical co- 
variance matrices and equal proportions for two clas- 
ses (y = ±1). Without loss of generality, the covari- 
ance matrices were taken to be the identity. For the 
first part of the simulation, the separation between 
the two classes is fixed. The separation is deter- 
mined by the difference in means of the multivariate 
normals, which, in turn, determine the Bayes error 
rate for the underlying problem. Throughout, once 
a method was fit to the data (i.e., a discriminant 
function was obtained) , the error rate was calculated 
analytically. Each part of the simulation consisted of 
400 replicates. 

Six methods were considered in this study: LDA 
(linear discriminant analysis) baseline method 
for the normal setting, the standard SVM, its vari- 
ant with squared hinge loss (called Smooth SVM in 
Lee and Mangasarian, 2001), another variant with 
quadratically modified hinge loss (referred to as Hu- 
berized SVM in Rosset and Zhu, 2007), logistic re- 
gression, and the method with linearized binomial 
deviance (referred to as linearized LR in this study). 
The Huberized SVM and linearized LR were imple- 
mented through the fast Newton- Armijo algorithm 
proposed for Smooth SVM in Lee and Mangasarian 




Fig. 6. Mean excess error of the SVM variant with quadrat- 
ically modified hinge loss (Huberized SVM) and the method 
with linearized deviance loss (linearized LR) as the bending 
constant k varies. The gray band indicates a one standard 
error bound around the mean estimate for Huberized SVM 
from 400 replicates. The standard error for comparison of 
the Huberized SVM to another method varies, but is consid- 
erably smaller, due to the simulation design. The horizontal 
lines from top to bottom are for SVM, logistic regression and 
Smooth SVM, respectively. 

(2001). To focus on the effect of the loss functions on 
the classification error rate, no penalty was imposed 
on the parameters of discriminant functions. 

For the first part of the study, the mean vectors 
were set with a difference of 2.7 in the first coordi- 
nate and elsewhere, yielding a Bayes error rate of 
8.851%. Figure 6 compares the SVM and its vari- 
ants in terms of the average excess error from the 
Bayes error rate. The k on the x-axis corresponds 
to the bending constant, 1 — 1/A 7 in the Huber- 
ized SVM. When k is as small as —1, we see that 
quadratic modification in the Huberized SVM effec- 
tively yields the same result as Smooth SVM. As k 
tends to 1, the Huberized SVM becomes the stan- 
dard SVM. Clearly, there is a range of k values for 
which the mean error rate of the Huberized SVM is 
lower than that of the standard SVM, demonstrat- 
ing improved efficiency in classification. In fact, the 
improved efficiency of smooth versions of the hinge 
loss in the normal setting can be verified theoreti- 
cally for large sample cases, where the relative effi- 
ciency is defined as the ratio of mean excess errors. 
See Lee and Wang (2011) for details. 

Figure 6 also displays a comparison between lo- 
gistic regression and the linearized LR of Section 4, 
with bending constant k = log{(l — A 7 )/A 7 }. There 
is no appreciable difference in the excess error be- 
tween logistic regression and its linearized version 
for negative values of k. Enhancing the robustness of 
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Fig. 7. Mean excess errors of (a) logistic regression and linearized LR, and (b) SVM and its variants, as the proportion of 
perturbation varies. The gray band indicates a one standard error bound around the mean estimate for (a) linearized LR and 
(b) Huberized SVM from 400 replicates. The standard errors for comparisons are considerably smaller than indicated by the 
bands. 



logistic regression (shown in part two of the study) 
sacrifices almost none of its efficiency. 

The value of the bending constant k leading to 
the minimum error rate depends on the underlying 
problem itself, and the range of best k values may 
differ for the Huberized SVM and linearized LR. The 
results in Figure 6 suggest that values of k ranging 
from —1 to yield excellent performance for both 
procedures in this setting. 

The second part of the study focuses on robust- 
ness. To study this, we perturbed each sample by 
flipping the class labels of a certain proportion of 
cases selected at random, and applied the six proce- 
dures to the perturbed sample. The estimated dis- 
criminant rules were evaluated in the same way as 
in the setting without perturbation. 

Figure 7(a) highlights increased robustness of lin- 
earized LR (with k = —0.5) compared to logistic re- 
gression when some fraction of labels are flipped. 
As the proportion of mislabeled data increases, ex- 
cess error rises for all of the procedures, including 
the baseline method of LDA. However, the rate of 
increase in error is slower for the modified logistic re- 
gression, as the linearized deviance dampens the in- 
fluence of mislabeled cases on the discriminant rule. 

Comparison of the SVM and its variants in the 
same setting reveals a trade-off between efficiency 
and robustness. Figure 7(b) shows that the squared 
hinge loss yields a lower error rate than hinge loss 
when the perturbation fraction is less than 6%. The 
trend is reversed when the fraction is higher than 
6%. This trade-off is reminiscent of that between 



the sample mean and median as location parameter 
estimators. The Huberized SVM (with k = — 0.5) as 
a hybrid method strikes a balance between the two. 
We note that the robustness of the SVM, compared 
with its variants, is more visible when two classes 
have less overlap (not shown here). 

The third part of the study provides a compre- 
hensive comparison of the methods. Three scenarios 
with differing degree of difficulty were considered; 
"easy," "intermediate" and "hard" settings refer to 
the multivariate normal setting with the Bayes error 
rates of 2.275%, 8.851% and 15.866%, respectively. 
In addition, for scenarios with mislabeled cases, 5% 
and 10% of labels were flipped under each of the 
three settings. Two values of the bending constant 
(k = —0.5 and —1) were used for the Huberized SVM 
and the linearized LR. The results of comparison un- 
der nine scenarios are summarized in Table 2. The 
tabulated values are the mean error rates of the dis- 
criminant rules under each method. 

When there are no mislabeled cases, the smooth 
variants of the SVM improve upon the performance 
of the standard SVM. As the separation between 
classes increases, the reduction in error due to mod- 
ification of the hinge loss with fixed k diminishes. 
Linearization of deviance in logistic regression does 
not appear to affect the error rate. In contrast, when 
there are mislabeled cases, linearization of the de- 
viance renders logistic regression more robust across 
all the scenarios with differing class separations. Sim- 
ilarly, the standard SVM is less sensitive to mis- 
labeling than its smooth variants. This makes the 
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Table 2 

Mean error rates of classification methods under various settings of mean difference and perturbation fraction. The lowest 
error rates are in bold when compared among the methods of the same type (either SVM or LR) for each scenario 



Huberized SVM Linearized LR 



Scenario 


SVM 


k = -0.5 


k = -1 


Smooth SVM 


k = -0.5 


k = -1 


LR 


Easy 


0.0385 


0.0376 


0.0376 


0.0376 


0.0362 


0.0363 


0.0363 


Intermediate 


0.1028 


0.1009 


0.1008 


0.1008 


0.1014 


0.1013 


0.1013 


Hard 


0.1753 


0.1727 


0.1726 


0.1726 


0.1730 


0.1729 


0.1728 


Easy + 5% flip 


0.0348 


0.0362 


0.0371 


0.0372 


0.0383 


0.0395 


0.0411 


Intermediate + 5% flip 


0.1063 


0.1050 


0.1057 


0.1059 


0.1054 


0.1061 


0.1071 


Hard + 5% flip 


0.1790 


0.1769 


0.1773 


0.1774 


0.1772 


0.1773 


0.1778 


Easy + 10% flip 


0.0370 


0.0415 


0.0423 


0.0421 


0.0445 


0.0465 


0.0481 


Intermediate + 10% flip 


0.1107 


0.1117 


0.1127 


0.1127 


0.1125 


0.1136 


0.1150 


Hard + 10% flip 


0.1846 


0.1833 


0.1839 


0.1840 


0.1836 


0.1841 


0.1848 



SVM more preferable as the proportion of misla- 
beled cases increases. However, in the difficult prob- 
lem of little class separation, the quadratic modifi- 
cation in the Huberized SVM performs better than 
the SVM. 

6. APPLICATIONS 
6.1 Analysis of the NHANES Data 

We numerically compare standard quantile regres- 
sion with modified quantile regression for analysis of 
real data. The Centers for Disease Control and Pre- 
vention conduct the National Health and Nutrition 
Examination Survey (NHANES), a large-scale sur- 
vey designed to monitor the health and nutrition of 
residents of the United States. Many are concerned 
about the record levels of obesity in the popula- 
tion, and the survey contains information on height 
and weight of individuals, in addition to a variety 
of dietary and health-related questions. Obesity is 
defined through body mass index (BMI) in kg/m 2 , 
a measure which adjusts weight for height. In this 
analysis, we describe the relationship between height 
and BMI among the 5938 males over the age of 18 
in the aggregated NHANES data sets from 1999, 
2001, 2003 and 2005. Our analyses do not adjust for 
NHANES' complex survey design. In particular, no 
adjustment has been made for oversampling of se- 
lected groups or nonresponse. Since BMI is weight 
adjusted for height, the null expectation is that BMI 
and height are unrelated. 

We fit a nonparametric quantile regression model 
to the data. The model is a six-knot regression spline 
using the natural basis expansion. The knots (held 
constant across quantiles) were chosen by eye. The 
rule for selection of the penalty parameter A 7 de- 



scribed in Section 5.1.1 was used for the NHANES 
data analysis. 

Figure 8 displays the fits from standard (QR) and 
modified (QR.M) quantile regressions for the quan- 
tiles between 0.1 and 0.9 in steps of 0.05. The fitted 
curves show a slight upward trend, some curvature 
overall, and mildly increasing spread as height in- 
creases. There is a noticeable bump upward in the 
distribution of BMI for heights near 1.73 meters. 
The differences between the two methods of fitting 
the quantile regressions are most apparent in the 
tails, for example the 0.6th and 0.85th quantiles for 
large heights. 

The predictive performances of the standard and 
modified quantile regressions are compared in Fig- 
ure 9. To compare the methods, 10-fold cross-valida- 
tion was repeated 500 times for different splits of the 
data. Each time, a cross-validation score was com- 
puted as 

1 - 

(12) C \ = -Tp g (y l -y i ), 

i=l 

where yi is the observed BMI for an individual in the 
hold-out sample, yi is the fitted value under QR or 
QR.M, and the sum runs over the hold-out sample. 
The figure contains plots of the 500 CV scores. The 
great majority of CV scores are to the lower right 
side of the 45 degree line, indicating that the mod- 
ified quantile regression outperforms the standard 
method — even when the QR empirical risk function 
is used to evaluate performance. Mean and 1000 
times standard deviation of the CV scores for the 
methods are summarized in Table 3. 

The pattern shown in these panels is consistent 
across other quantiles (not shown here). The pattern 
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Fig. 8. Regression spline estimates of conditional BMI quantiles in steps of 0.05, from 0.1 to 0.9 for the NHANES data. 
Natural spline bases and six knots are used in each fitted curve. 



becomes a bit stronger when the QR.M empirical 
risk function is used to evaluate performance. 

Quantile regression has the property that 100 • q% 
of the responses fall at or below the fitted gth quan- 
tile surface. This does not have to hold for the mod- 



ified quantile regression fit. However, as the cross- 
validation shows, QR.M does provide a better quan- 
tile regression surface than QR. 

Modified quantile regression has an additional ad- 
vantage which is apparent for small and large heights. 




Fig. 9. Scatterplots of 10-fold CV scores from standard quantile regression (QR) and modified quantile regression (QR.M) 
at 0.25th, 0.5th and 0.9th quantiles. Regression splines with natural spline bases and six knots are fitted to the NHANES data. 
Each of 500 points represents a pair of CV scores as in (12). 
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Table 3 

Mean and (1000 times standard deviation) of CV scores at 
selected quantiles based on 500 replicates from NHANES data 



Method 


q = 0.25 


q = 0.5 


q = 0.9 


QR 


1.5040 (0.6105) 


2.0405 (0.7272) 


1.1267 (1.0714) 


QR.M 


1.5039 (0.5855) 


2.0402 (0.6576) 


1.1263 (1.0030) 


QR.L1 


1.5039 (0.8963) 


2.0393 (0.5140) 


1.1289 (0.8569) 



The standard quantile regression fits show several 
crossings of estimated quantiles, while crossing be- 
havior is reduced considerably with modified quan- 
tile regression. Crossed quantiles correspond to 
a claim that a lower quantile lies above a higher 
quantile, contradicting the laws of probability. Fig- 
ure 10 shows this behavior. Fixes for this behavior 
have been proposed (e.g., He, 1997), but we consider 
it desirable to lessen crossing without any explicit 
fix. The reduction in crossing holds up across other 
data sets that we have examined and with regression 
models that differ in their details. 

In addition, we compare both methods with i\ 
parameter-penalized quantile regression (QR.L1), 
where the estimator /3 is defined as the minimizer of 



i=i 



PqiVi 



The rq. f it . lasso function in the quantreg R pack- 
age was used for QR.L1. Keeping the same split of 



data into 90% of training and 10% of testing for each 
replicate, we have chosen A^ among 100 candidate 
values by 9-fold cross-validation. The results are in 
Table 3. 

The effect of parameter penalization differs from 
modification of the loss function. Figure 10 illus- 
trates the difference. The quantiles estimated under 
QR.L1 (with A^ chosen by 10-fold cross-validation) 
show less variation across x relative to the fitted 
median line, due to the shrinkage of each (3j to- 
ward 0. This effect is more visible for large quan- 
tiles. Such nondifferential penalty can degrade per- 
formance, unless the parameters are of comparable 
size. This adverse effect is numerically evidenced in 
the large CV score of QR.L1 for q = 0.9 in Table 3. 
For q = 0.25 and 0.5, QR.L1 yields similar results to 
the other two methods in terms of the CV scores. 

6.2 Analysis of Language Data 

Balota et al. (2004) conducted an extensive lexical 
decision experiment in which subjects were asked 
to identify whether a string of letters was an En- 
glish word or a nonword. The words were monosyl- 
labic, and the nonwords were constructed to closely 
resemble words on a number of linguistic dimen- 
sions. Two groups were studied — college students 
and older adults. The data consist of response times 
by word, averaged over the thirty subjects in each 
group. For each word, a number of covariates was 
recorded. Goals of the experiment include determin- 
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Fig. 10. Differences between fitted median line and the other fitted quantiles for standard quantile regression (QR), modified 
quantile regression (QR.M), and i\ penalized quantile regression (QR.L1) for the NHANES data. The dashed lines are the 
minimum and maximum of the observed heights. 
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ing which features of a word (i.e., covariates) affect 
response time, and whether the active features af- 
fect response time in the same fashion for college 
students and older adults. The authors make a case 
for the need to conduct and analyze studies with 
regression techniques in mind, rather than simpler 
ANOVA techniques. 

Baayen (2007) conducted an extensive analysis of 
a slightly modified data set which is available in his 
languageR package. In his analysis, he creates and 
selects variables to include in a regression model, 
addresses issues of nonlinearity, collinearity and in- 
teraction, and removes selected cases as being influ- 
ential and/or outlying. He trims a total of 87 of the 
4568 cases. The resulting model, based on "typical" 
words, is used to address issues of linguistic impor- 
tance. It includes seventeen basic covariates which 
enter the model as linear terms, a nonlinear term for 
the written frequency of a word (fit as a restricted 
cubic spline with five knots) , and an interaction term 
between the age group and the (nonlinear) written 
frequency of the word. 

We consider two sets of potential covariates for 
the model. The small set consists of Baayen's 17 
basic covariates and three additional covariates rep- 
resenting a squared term for written frequency and 
the interaction between age group and the linear 
and squared terms for written frequency. Age group 
has been coded as ±1 for the interactions. The large 
set augments these covariates with nine additional 
covariates that were not included in Baayen's final 
model. Baayen excluded some of these covariates for 
a lack of significance, others because of collinearity. 

To investigate the performance of the LASSO and 
robust LASSO, a simulation study was conducted on 
the 4568 cases in the data set. For a single repli- 
cate in the simulation, the data were partitioned 
into a training data set and a test data set. The 
various methods were fit to the training data, with 
evaluation conducted on the test data. The criteria 
for evaluation were sum of squared differences be- 
tween the fitted and observed responses, either over 
all cases in the test data or over the test data with 
the cases identified by Baayen as outliers removed. 
We refer to these criteria as predictive mean squared 
error (PMSE). 

The simulation investigated several factors, includ- 
ing the amount of training data (10% of the full 
data, 20%, 30%, etc.), the regularization parameter 
A 7 = ka, and the method used to select the model. 
Three methods were used to select the model (i.e., 
the fraction of the distance along the solution path): 



minimum C p , generalized cross-validation, and 10- 
fold cross-validation on the training data. 

The results of a 300 replicate simulation show 
a convincing benefit to use of the robust LASSO. 
The benefit of the robust LASSO is most apparent 
when k is in the "sweet spot" ranging from 1.4 or 
so to well above 2.0. As expected, for very small k 
(near 1), the robust LASSO may not perform as 
well as the LASSO. The reduction in PMSE for 
moderate values of k, both absolute and percent, 
is slightly larger when the evaluation is conducted 
after outliers (as identified by Baayen — not by the 
fitted model) have been dropped from the test data 
set. The benefit is largest for small training data sets 
and decreases as the size of the training data set in- 
creases. For large training data sets (e.g., 90% of 
the data), little test data remains for calculation of 
PMSE and the evaluation is less stable. These pat- 
terns were apparent over all three methods of model 
selection. Figure 11 shows the results for a training 
sample size of 1827 cases (40% of the data), with 
model selected by cross-validation, for a variety of 
values of k. The PMSE for the robust LASSO dips 
below the mean PMSE for the LASSO for a wide 
range of k. The figure also presents 95% confidence 
intervals, based on the 300 replicates in the simula- 
tion, for the difference between mean PMSE under 
the robust LASSO and the LASSO. The intervals 
are indicated by the vertical lines, and statistical 
significance is indicated where the lines do not over- 
lap the mean PMSE under the LASSO. The narrow- 
ing of the intervals is a consequence of the greater 
similarity of LASSO and robust LASSO fits as the 
bending constant increases. The patterns just de- 
scribed hold for both the small set of covariates and 
the large set of covariates. 

In addition to using the test data as a target, 
we studied how well the two methods could repro- 
duce Baayen's expert fit. This makes a good tar- 
get for inference, as there is evidence that humans 
can produce a better fit than automated methods 
(Yu, MacEachern and Peruggia, 2011). Taking a fit- 
ted surface as a target allows us to remove the noise 
inherent in data-based out-of-sample evaluations. 
The results from a 5000 replicate simulation study 
with a training sample size of 400 appear in Fig- 
ure 12. The criterion is sum of squared deviations 
(SSD) between the (robust) LASSO fit and Baayen's 
fit, with the sum taken over only those covariate val- 
ues contained in the test data set. The results pre- 
sented here are for models selected with the mini- 
mum C p criterion. The robust LASSO outperforms 



20 



Y. LEE, S. N. MACEACHERN AND Y. JUNG 



CO 
CO 



HI 

CO 

I <=>. 

o CD 

o 
o 




Fig. 11. Predictive mean squared error (PMSE) for the test data in the simulation study, after removal of cases identified 
by Baayen as outliers. The horizontal line is the mean PMSE for the LASSO while the points represent the mean of PMSEs 
for the robust LASSO. The vertical lines have the width of approximate 95% confidence intervals for the difference in mean 
PMSE under the LASSO and robust LASSO. Panel (a) presents results for the small set of covariates and panel (b) presents 
results for the large set of covariates. 



the LASSO over a wide range of values for k for both 
the small and large sets of covariates. 

Figures 11 and 12 reveal an interesting difference 
across targets in the behavior of the small and large 
sets of covariates. When the target is an expert fit, as 
in the second study, adding covariates not present 
in the expert's model to the pool of potential co- 
variates allows the LASSO and robust LASSO to 
produce a near-equivalent fit to the data, but with 
different coefficients for the regressors. An examina- 
tion of the variables present in the fitted models and 
their coefficients uncovers patterns. As an example, 
the two covariates "WrittenFrequency" and "Famil- 
iarity" appear in nearly all of the models for both 



the LASSO and the robust LASSO, while Baayen in- 
cludes only "WrittenFrequency" in his model, and 
these covariate(s) have negative coefficients. Sub- 
jects are able to decide that a familiar word is a word 
more quickly (and more accurately) than an unfa- 
miliar word. Although there seems to be no debate 
on whether this conceptual effect of similarity ex- 
ists, there are a variety of viewpoints on how to best 
capture the effect. Regularization methods allow one 
to include a suite of covariates to address a single 
conceptual effect, and this produces a difference be- 
tween the LASSO and robust LASSO fits on one 
hand and a least-squares, variable-selection style fit 
on the other hand. The end result is that the regular- 
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Fig. 12. Sum of squared deviations (SSD) from Baayen's fits in the simulation study. The horizontal line is the mean SSD for 
the LASSO while the points represent the mean of SSDs for the robust LASSO. The vertical lines have the width of approximate 
95% confidence intervals for the difference in mean SSD under the LASSO and robust LASSO. Panel (a) presents results for 
the small set of covariates and panel (b) presents results for the large set of covariates. 
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ized fits with the large set of covariates show greater 
departures from Baayen's fit than do regularized fits 
with the small set of covariates. In contrast, under 
the data-based target of the first study, the large set 
of covariates results in a smaller PMSE. 

7. DISCUSSION 

In the preceding sections, we have laid out an ap- 
proach to modifying modeling procedures. The ap- 
proach is based on the creation of case-specific co- 
variates which are then regularized. With appropri- 
ate choices of penalty terms, the addition of these 
covariates allows us to robustify those procedures 
which lack robustness and also allows us to improve 
the efficiency of procedures which are very robust, 
but not particularly efficient. The method is fully 
compatible with regularized estimation methods. In 
this case, the case-specific covariates are merely in- 
cluded as part of the regularization. The techniques 
are easy to implement, as they often require lit- 
tle modification of existing software. In some cases, 
there is no need for modification of software, as one 
merely feeds a modified data set into existing rou- 
tines. 

The motivation behind this work is a desire to 
move relatively automated modeling procedures in 
the direction of traditional data analysis (e.g., Weis- 
berg, 2004). An important component of this type of 
analysis is the ability to take different looks at a data 
set. These different looks may suggest creation of 
new variates and differential handling of individual 
cases or groups of cases. Robust methods allow us to 
take such a look, even when data sets are large. Cou- 
pling robust regression techniques with the ability to 
examine an entire solution path provides a sharper 
view of the impact of unusual cases on the analysis. 
A second motivation for the work is the desire to 
improve robust, relatively nonparametric methods. 
This is accomplished by introducing case-specific pa- 
rameters in a controlled fashion whereby the finite 
sample performance of estimators is improved. 

The perspective provided by this work suggests 
several directions for future research. Adaptive penal- 
ties, whether used for robustness or efficiency, can be 
designed to satisfy specified invariances. The asym- 
metric £2 penalty for modified quantile regression 
was designed to satisfy a specified invariance. For 
a locally constant residual density, it keeps the 
of the tpq function invariant as the width of the 
interval of adjustment varies. Specific, alternative 
forms of invariance for quantile regression are sug- 
gested by consideration of parametric driving forms 



for the residual distribution. A motivating paramet- 
ric model, coupled with invariance of the of the ipq 
function to the size of the penalty term A 7 , yields 
a path of penalties. Increasing the size of the covari- 
ate-specific penalty at an appropriate rate leads to 
asymptotic equivalence with the quantile regression 
estimator. This allows one to fit the model nonpar a- 
metrically while tapping into an approximate para- 
metric form to enhance finite sample performance. 
Similarly, when case-specific penalties are applied 
to a model such as the generalized linear model, the 
asymmetry of the likelihood, coupled with invari- 
ance, suggests an asymmetric form for the l\ penalty 
used to enhance robustness of inference. 

Following development of the technique for quan- 
tile regression, one can apply the adaptive loss para- 
digm for model assessment and selection. For ex- 
ample, in cross-validation, a summary of a model's 
fit is computed as an out-of-sample estimate of em- 
pirical risk and the evaluation is used for choosing 
the model (parameter value) with the smallest esti- 
mated risk. For model averaging, estimated risks are 
converted into weights which are then attached to 
model-specific predictions that are then combined 
to yield an overall prediction. The use of modified 
loss functions for estimation of risks is expected to 
improve stability and efficiency in model evaluation 
and selection. 

APPENDIX 

Proof of Theorem 2. Let ui := yi — xj j3(q) and 
consider the objective function 

n 

(13) Z~<{5) := - xJS/y/n) ~ Pq (ui)}. 

i=i 

Note that Z2(5) is minimized at S n := y/n(pn — /3(q)), 
and the limiting distribution of 5 n is determined by 
the limiting behavior of Z2(S). To study the limit of 
Z2.(S), decompose Z%(6) as 

n 

z li 5 ) = J2{Pq( u i ~ xJS/y/n) - p q {Ui - xj8/^)} 
i=l 

n 

+ J2{pq( u i ~ x J 5 /Vn) ~ Pg(ui)} 

i=l 

n 

= ^{Pq( u i ~ xlS/y/n) - p g {ui - xJS/y/n)} 

i=l 

+ Z n (S), 

where Z n (5) := Y17=iiP<i( u i ~ X J 5 /V™) ~ Pq( u i)}- 
By showing that the first sum converges to zero in 
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probability up to a sequence of constants that do Taking C n := — q(l — q)/(2cn a x ) + q(l — q)/ 

not depend on S, we will establish the asymptotic (6c 2 n 2a ) ]P™ =1 we have that 

equivalence of pn to j3 n . 
Given A 7 = cn a , first observe that 

E{p 1 q (u i - xj5/y/n) - p q (ui - xJd/^/n)} 



+ q{l-q)/2\ 1 

(l-g)/A 7 +xT '5/y/E 



2 1-q 



xj 5 



u 



u 



+ 



g(l-g) 

2A-, 



■fi(£i + u)du 



+ 



-q/Xy+xJ 5/ \/n 



A 7 1 



x^ 



x7<5 



+ 



9(1 



2A 7 



fi{£i + u)du 



2 1-g 

xJS 



n 



fi{ii + u)du 



+ 



xJS/y/n. 



A 7 1 



g 



xj5 



u 



^ A-v 



• fi(£i + u) du. 
Using a first-order Taylor expansion of /j at £j from 
the condition (C-2) and the expression above, we 
have 

n 

E /^{Pq( u i ~ xJ&/Vn) -p q {ui - xj5/y/n)} 



+ nq(l-q)/2\ 1 

g(i-g: " 

6c 2 n 2a 



i=l 
-2a+l/2\ 



g(l-g) ^ /;feK T J 

g c 2 n 2a 



t=l 



J? 



E ^2{pq( u i ~ xj8/Vn) ~ p q {Ui - xJS/y/n)} - C n 
t=l 

-►0 if a > 1/4. 
Similarly, it can be shown that 

n 

Vax^{p^(-Ui - xJS/y/n) - p q {ui - xj8/y/n)} 

g 2 (i-g) 2 /*(&) 



t=i 

n 
i=l 



+ o(n" 



-3a+l> 



20c 3 n 3a 
-►0 if a > 1/3. 
Thus, if a > 1/3, 

n 

- xj 5/y/n) - pq(Ui - xj5/^) - C n 

1=1 

—7-0 in probability. 

This implies that the limiting behavior of Z%(5) 
is the same as that of Z n {5). From the proof of 

Theorem 4.1 in Koenker (2005), Z n (5) A -5 T W + 
±5 T L>i<5, where W ~ N(0,q(l - q)D ). By the con- 
vexity argument in Koenker (2005) (see also Pollard, 
1991; Hjort and Pollard, 1993; Knight, 1998), 5 n , 
the minimizer of Zn(5), converges to <5o := D^ 1 W , 
the unique minimizer of — 5 T W + D\5 in distri- 
bution. This completes the proof. 
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